fix: translate SIMILAR TO pattern to PostgreSQL-compatible regex#22273
Open
Dandandan wants to merge 1 commit into
Open
fix: translate SIMILAR TO pattern to PostgreSQL-compatible regex#22273Dandandan wants to merge 1 commit into
Dandandan wants to merge 1 commit into
Conversation
`SIMILAR TO` was lowered directly to a regex match without translating SQL wildcards or anchoring the pattern, so `'abc' SIMILAR TO 'a%'` returned false instead of true and `.`/`^`/`$` were treated as regex metacharacters instead of literals. The planner now translates literal `SIMILAR TO` patterns into an equivalent POSIX regex (anchored with `^...$`, `%`→`.*`, `_`→`.`, literal `.`/`^`/`$` escaped, backslash escape and bracket expressions preserved) before lowering to the existing regex-match operator. NULL patterns flow through as a typed Utf8 null. Non-literal patterns now return a clear `not_impl` error rather than a silently wrong result, and the SQL-layer pattern-type check is widened to accept `LargeUtf8` and `Utf8View` literals. Closes apache#22263. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
SIMILAR TOshould treat%as a wildcard #22263.Rationale for this change
SIMILAR TOis supposed to mix SQL LIKE wildcards (%,_) with POSIX regex metacharacters and match the entire string. DataFusion was loweringSIMILAR TOdirectly to aRegexMatchover the pattern verbatim, so:'abc' SIMILAR TO 'a%'returnedfalse(PG:true) because%was passed through as a literal regex character.'abc' SIMILAR TO 'b'returnedtrue(PG:false) because the regex match isn't anchored..,^,$were treated as regex metacharacters instead of literals.What changes are included in this PR?
translate_similar_to_pattern()helper indatafusion/physical-expr/src/expressions/binary.rsthat converts aSIMILAR TOpattern into an equivalent POSIX regex:^...$.%→.*,_→...,^,$(literal in SIMILAR TO, meta in regex).|,*,+,?,(),{m,n},[...]) through unchanged.\as an escape for the next character.]/^].datafusion/physical-expr/src/planner.rs) now translates literalUtf8/LargeUtf8/Utf8Viewpatterns at planning time. NULL patterns flow through as a typedUtf8null. Non-literal patterns return a clearnot_impl_errrather than the previous silently-wrong behavior.datafusion/sql/src/expr/mod.rs) pattern-type check widened to acceptLargeUtf8andUtf8Viewliterals (previously rejected even though the underlying regex match supports them).Are these changes tested?
Yes:
similar_to_pattern_translationinbinary.rscovers wildcards, anchoring, regex metas, literal./^/$, backslash escapes, bracket expressions (including[]abc]and[^]abc]).datafusion/sqllogictest/test_files/strings.slthas a new regression block exercising the bug-report case,_wildcard, anchoring, literal./^/$, regex metas (|,{m,n},+), backslash-escaped wildcards,NULLpattern, and the non-literal-pattern error.SIMILAR TO 'p[12].*'test instrings.sltwas relying on the buggy regex-passthrough behavior (p1e1etc. matched only because.was treated as regex.); it's been changed to'p[12]%'which expresses the same intent under correct SIMILAR TO semantics.Are there any user-facing changes?
Yes —
SIMILAR TOis now PostgreSQL-compatible:%,_, or unanchored matches. Existing patterns that relied on the previous regex-passthrough behavior may need to be updated (most obviously, change.*to%).Not yet implemented: SIMILAR TO with a non-literal pattern is not yet supportedinstead of silently returning a regex match. This was almost certainly broken in practice already, but it is a visible error message change.